Data Science is a field of science. We try to extract useful information from data. In order to use the data efficiently and correctly we must understand the data first. According to the goal of the study, combining the domain knowledge, we then design the study. In this lecture we first go through some basic explore data analysis to understand the nature of the data, some plausible relationship among the variables.
Data mining tools have been expanded dramatically in the past 20 years. Linear model as a building block for data science is simple and powerful. We introduce/review linear models. The focus is to understand what is it we are modeling; how to apply the data to get the information; to understand the intrinsic variability that statistics have.
Suggested reading:
car_04_regular.csv and Cars_04.csvElements to multiple regression
Categorical predictors
More about multiple Regression
Summary
Appendices
R functions
When an automobile manufacture designs a new car, what is the benchmark to compare with in terms of fuel efficiency? How well can we predict housing price based on the house characteristics? Multiple regression is an extension of simple regression. It explores relationship between one variable (response) with many other variables (predictors or features). We highlight the regression methods in this lecture. It is important to understand the effect of each predictor on the response depending on what else are included in the model. Our approach here unifies usual multiple regression and ANOVA.
Goal of the study: How to build a fuel efficient car?
We will use the car_04_regular.csv data set to perform a case study on fuel efficiency of automobiles. Below is a quick description of all the features in our dataset.
| Feature | Description |
|---|---|
| Continent | Continent the Car Company is From |
| Horsepower | Horsepower of the Vehicle |
| Weight | Weight of the vehicle (thousand lb) |
| Length | Length of the Vehicle (inches) |
| Width | Width of the Vehicle (inches) |
| Seating | Number of Seats in the Vehicle |
| Cylinders | Number of Engine Cylinders |
| Displacement | Volume displaced by the cylinders, defined as \(\pi/4 \times bore^{2} \times stroke \times \text{Number of Cylinders}\) |
| Transmission | Type of Transmission (manual, automatic, continuous) |
Once we know the data structure we will rephrase our goal of study. Specifically, fuel efficiency is measured by Mileage per Gallon, \(Y\) = MPG_City
Predictors: Effects of each feature on \(Y\)
Estimate - the mean MPG_City for all such cars specified below and - predict \(Y\) for the particular car described below
Are cars built by Asian more efficient?
In particularly, we would like to investigate the MPG_City for this newly designed American car
| Feature | Value |
|---|---|
| Continent | America |
| Horsepower | 225 |
| Weight | 4 |
| Length | 180 |
| Width | 80 |
| Seating | 5 |
| Cylinders | 4 |
| Displacement | 3.5 |
| Transmission | automatic |
It is always crucial to look at the data first. For the purpose of shortening the lecture time, we put this part in Appendix 1 which explains how we cleaned the original data Car_04.csv and created a new version called car_04_regular.csv.
To summarize the EDA quickly, we excluded some high-performance cars with large Horsepower (HP > 400). We also take a subset of features, containing only the features described in the description table.
data1 <- read.csv("data/car_04_regular.csv", header=TRUE)
names(data1)## [1] "Make.Model" "Continent" "MPG_City" "MPG_Hwy"
## [5] "Horsepower" "Weight" "Length" "Width"
## [9] "Seating" "Cylinders" "Displacement" "Make"
## [13] "Transmission"
dim(data1) # 226 cars and 13 variables## [1] 226 13
str(data1)## 'data.frame': 226 obs. of 13 variables:
## $ Make.Model : Factor w/ 226 levels "Acura_MDX","Acura_NSX",..: 3 5 6 4 2 1 7 8 9 10 ...
## $ Continent : Factor w/ 3 levels "Am","As","E": 2 2 2 2 2 2 3 3 3 3 ...
## $ MPG_City : int 18 20 23 25 17 17 20 18 17 16 ...
## $ MPG_Hwy : int 24 28 32 34 24 23 28 25 24 22 ...
## $ Horsepower : int 225 270 200 160 252 265 170 220 330 250 ...
## $ Weight : num 3.9 3.58 3.32 2.77 3.2 ...
## $ Length : num 197 189 183 172 174 ...
## $ Width : num 71.6 72.2 69.4 67.9 71.3 77 76.3 76.1 74.6 76.1 ...
## $ Seating : int 5 5 5 4 2 7 5 5 5 5 ...
## $ Cylinders : int 6 6 4 4 6 6 4 6 6 6 ...
## $ Displacement: num 3.5 3.2 2.4 2 3 3.5 1.8 3 4.2 2.7 ...
## $ Make : Factor w/ 38 levels "Acura","Audi",..: 1 1 1 1 1 1 2 2 2 2 ...
## $ Transmission: Factor w/ 3 levels "automatic","cont_variable",..: 1 1 1 1 1 1 1 1 1 1 ...
head(data1) #View(data1)## Make.Model Continent MPG_City MPG_Hwy Horsepower Weight Length Width
## 1 Acura_RL As 18 24 225 3.90 197 71.6
## 2 Acura_TL As 20 28 270 3.58 189 72.2
## 3 Acura_TSX As 23 32 200 3.32 183 69.4
## 4 Acura_RSX As 25 34 160 2.77 172 67.9
## 5 Acura_NSX As 17 24 252 3.20 174 71.3
## 6 Acura_MDX As 17 23 265 4.45 189 77.0
## Seating Cylinders Displacement Make Transmission
## 1 5 6 3.5 Acura automatic
## 2 5 6 3.2 Acura automatic
## 3 5 4 2.4 Acura automatic
## 4 4 4 2.0 Acura automatic
## 5 2 6 3.0 Acura automatic
## 6 7 6 3.5 Acura automatic
We create a new car for prediction later as follows.
newcar <- data1[1, ] # Create a new row with same structure as in data1
newcar[1] <- "NA" # Assign features for the new car
newcar[2] <- "Am"
newcar["MPG_City"] <- "NA"
newcar["MPG_Hwy"] <- "NA"
newcar[5:11] <- c(225, 4, 180, 80, 5, 4, 3.5)
newcar$Make <- "NA"
newcar[13] <- "automatic"
newcar## Make.Model Continent MPG_City MPG_Hwy Horsepower Weight Length Width
## 1 NA Am NA NA 225 4 180 80
## Seating Cylinders Displacement Make Transmission
## 1 5 4 3.5 NA automatic
How does Length affect MPG_City?
It depends on how we model the response. We will investigate three models with Length.
For the ease of presentation, we define some predictors below that we will use in subsequent models:
\[x_1 = Length, \quad x_2 = Horsepower, \quad x_3 = Width, \quad x_4 = Seating, \quad x_5 = Cylinders, \quad x_6 = Displacement\]
We will create three models:
M1. Our first model will only contain one predictor, Length:
\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \epsilon\]
Interpretation of \(\beta_1\) is that in general, the mean \(y\) will change by \(\beta_1\) if a car is 1" longer. So we can’t really peel off the effect of the Length over \(y\).
M2. Next, we add the predictor Horsepower to our model
\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \epsilon\]
Interpretation of \(\beta_1\) is that in general, the mean \(y\) will change by \(\beta_1\) if a car is 1" longer and the ’Horse Power’s` are the same.
M3. Finally, we fit a model with multiple predictors
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{i4} + \beta_5x_{i5} + \beta_6x_{i6} + \epsilon\]
Interpretation of \(\beta_1\) is that in general, the mean \(y\) will change by \(\beta_1\) if a car is 1" longer and the rest of the features are the same.
Question: Are all the \(\beta_1\)s same in the 3 models above?
No. The effect of Length \(\beta_1\) depends on the rest of the features in the model!!!!
In general, We define a multiple regression as
\[Y = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_i\]
\[\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \]
\[\textbf{Var}(y_i | x_{i1}, x_{i2}, \dots, x_{ip}) = \sigma^2\]
\[\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\] This is same to say the errors are independent and with a normal distribution of mean \(0\) and an equal variance \(\sigma^2\).
Parameters of interests are all the \(\beta's\) and \(\sigma^2\).
Simple interpretations of each \(\beta_i's\)
Simple models for predictions
These \(\beta\) parameters are estimated using the same approach as simple regression, specifically by minimizing the sum of squared residuals (\(RSS\)):
\[\min_{b_0,\,b_1,\,b_{2},\dots,b_{p}} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_{i1} - b_2 x_{i2} - \dots - b_p x_{ip})^{2}\]
To be specific:
We define let \(X\) denote a \(n × (p+1)\) matrix whose \((i,j)\)th element is \(x_{ij}\). That is:
\[X= \begin{pmatrix} 1& x_{11} & x_{12} & \dots & x_{1p} \\ 1& x_{21} & x_{22} & \dots & x_{1p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1& x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix}\]
Mathematical formula for the OLS estimates:
The least squared estimators can be obtained precisely by \[\hat\beta = (X^{T}X)^{-1} X^{T}Y \]
Variance of the OLS:
The variance-covariance matrix of \(\hat \beta\)
\[\textbf{V}[\hat\beta | X] = \sigma^{2} (X^{T}X)^{-1}\] Remark:
The above covariance matrix is obtained with a homeostatic variance of \(y's\).
Normality of the OLS:
Under all the linear model assumptions, the LSE’s are unbiased estimators of \(\beta\) and they are normally distributed. \[ \hat\beta | X \sim N(0, \sigma^{2} (X^{T}X)^{-1}\])
Standard errors
As soon as we know how to estimate common \(\sigma^2\), we can estimate \(\textbf{V}[\hat\beta | X]\) by \(\hat {\textbf{V}}[\hat\beta | X] = \hat \sigma^{2} (X^{T}X)^{-1}\). The standard error for each \(\hat \beta_i\) will be \[\text {se}\, (\hat \beta_i) = \sqrt{( \hat \sigma^{2} (X^{T}X)^{-1} )}|_{ith\,\text{Diag}}\]
Confidence intervals each \(\beta_i\):
a $95%% t-interval for \(\beta_i\) is \[\hat \beta_i \pm t^*_{.025} \text{se}(\hat \beta_i).\]
Here \(t^*_{.025}\) is the upper .025 percentile from a \(t\) distribution. You may take it to be approximately \(1.96\) if we have have a large number of observations \(n\).
\(\text{se}(\hat \beta_i)\) is taken from the variance formula above.
It is a t-confidence interval because we need to estimate the unknown \(\sigma^2\).
Hypothesis test for each \(\beta_i\):
To test that \[\beta_i = 0 \;\;\; \text{vs.}\;\;\; \beta_i = 0\]
which means that given other variables in the model, there is no \(x_i\) effect. We carry out a t-test:
\[ \text{tstat} = \frac{\hat \beta_i - 0}{\text{se} (\hat \beta_i)}\] The p-value is: \[\text{p-value} = 2 \times P(\text{T variable} > \text{tstat})\].
We reject the null hypothesis at an \(\alpha\) level if the p-value is < \(\alpha\).
A \(95\%\) Confidence interval for the mean given a set of predictors:
\[\hat y \pm \,\,t^*_{.025}\,\, \times se(\hat y) \]
A \(95\%\) prediction interval for a future \(y\) given a set of predictors: \[\hat y \pm \,\,t^*_{.025}\,\, \times \hat \sigma.\]
For multiple regression, \(RSS\) is estimated as:
\[RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^{2} = \sum_{i=1}^{n} (y_i-\hat y_i)^2= \sum_{i=1}^{n} (y_i - (\hat\beta_0 + \hat\beta_1 x_{i1} + \hat\beta_2 x_{i2} + \dots + \hat\beta_p x_{ip}))^{2}\]
\(\sigma^2\) is estimated by the Mean Sum of Squares (MSE). For multiple regression, MSE is defined as:
\[MSE = \frac{RSS}{n-p-1} \\ \text{where }p =\text{number of predictors}\]
\(\sigma\) is estimated by the Residual Standard Error (RSE). For multiple regression, RSE is defined as:
\[\hat \sigma= RSE = \sqrt{MSE} = \sqrt{\frac{RSS}{n-p-1}}\] Remark:
Notice the denominator in \(MSE\) is \(n-(p+1)\), where \(p+1\) is the number of predictors in the model +1. By doing so we have a nice mathematical property: \(RSE\) is an unbiased estimator of \(\sigma^2\). In another word, in a long run, on average the estimator hits the quantity being estimated.
Total Sum of Squares (TSS):
TSS measures the total variance in the response \(Y\). When all the responses are the same which means no predictors affect the response values, then \(TSS=0\). In contrast, RSS measures the amount of variability that is left unexplained after performing the regression.
\[TSS = \sum_{i=1}^{n} (y_i - \bar{y})^{2}\]
\(R^2\):
How much variability is captured in the linear model using this set of predictors? \(R^{2}\) measures the proportion of variability in \(Y\) that can be explained using this set of predictors in the model.
\[R^{2} = \frac{TSS - RSS}{TSS}\] Remark 1:
\(TSS \geq RSS\). Why so??? \(R^2 \leq 1\). \(TSS=RSS + \sum (\hat y_i - \bar y) ^2\). \((corr (y, \hat y))^2\)
An \(R^{2}\) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression.
Remark2:
How large \(R^2\) needs to be so that you are comfortable to use the linear model? Though \(R^2\) is a very popular notion of goodness of fit, but it has its limitation. Mainly all the sum of squared errors defined so far are termed as Training Errors. It really only measures how good a model fits the data that we use to build the model. It may not generate well to unseen data.
Linear models are so popular due to their nice interpretations as well as clean solutions. R-function lm() will be. It takes a model specification together with other options, outputs all the estimators, summary statistics such as varies sum of squares, standard errors of estimators, testing statistics and p-values. Finally the predicted values with margin of errors or confidence intervals and prediction intervals can be called.
Linear model effects or coefficients are defined within the model, they depend on the rest of the variables in the model. Let us compare the three model fits, focusing on the interpretations over the coefficients of length in different models.
We first do a simple regression. Let the response \(y_i\) be the MPG_City and the explanatory variable \(x_{i1}\) be Length (\(i = 1, \dots, n=226\)).
\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \epsilon\]
Model assumptions:
Length, the mean of MPG_City is described as \(\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_i\)MPG_City are the same for any Length i.e. \(\textbf{Var}(y_i | x_{i1}) = \sigma_{1}^2\)MPG_City is independent and normally distributed. i.e. \(\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma_{1}^2)\)We now create a model with lm()
fit1 <- lm(MPG_City ~ Length, data = data1) # model specification response ~ x1,..
ggplot(data1, aes(x = Length , y = MPG_City)) +
geom_point() +
geom_smooth(method="lm",se=F) +
geom_hline(aes(yintercept = mean(MPG_City)), color = "red") ## `geom_smooth()` using formula 'y ~ x'
Note from the summary below, the \(\hat\beta\) for Length is estimated as -0.14. We say on average MPG drops \(.13983\) if a car is \(1\)’’ longer.
fit1 <- lm(MPG_City ~ Length, data = data1) # model one
summary(fit1) ##
## Call:
## lm(formula = MPG_City ~ Length, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.626 -2.279 -0.151 1.977 10.731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.3138 2.8975 15.64 <2e-16 ***
## Length -0.1398 0.0155 -9.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.18 on 224 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.263
## F-statistic: 81.2 on 1 and 224 DF, p-value: <2e-16
# predict(fit1, newcar, interval ="prediction", se.fit = TRUE)
# TSS <- sum((data1$MPG_City - mean(data1$MPG_City))^2)
# RSS <- sum((fit1$res)^2); RSS
# model diagnosese
# plot(fit1$fit, fit1$res); same as plot(fit1, 1) # linearity and homoscedasticity
# qqnorm(fit1$res); qqline(fit1$res) # looking for a well fitted straight line
# plot(fit1, 2)MPG_City ~ Length + HorsepowerLet the response \(y_i\) be the MPG_City and the explanatory variables be Length and Horsepower (\(i = 1, \dots, n=226\)).
\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \epsilon\]
fit2 <- lm(MPG_City ~ Length + Horsepower, data = data1)
stargazer::stargazer(fit2, type=output_format, align=TRUE)| Dependent variable: | |
| MPG_City | |
| Length | -0.062*** |
| (0.013) | |
| Horsepower | -0.037*** |
| (0.003) | |
| Constant | 38.600*** |
| (2.230) | |
| Observations | 226 |
| R2 | 0.591 |
| Adjusted R2 | 0.587 |
| Residual Std. Error | 2.380 (df = 223) |
| F Statistic | 161.000*** (df = 2; 223) |
| Note: | p<0.1; p<0.05; p<0.01 |
A couple quick notes:
lm() is similar regardless how many predictors are used.summary(fit2) #sum((fit2$res)^2)##
## Call:
## lm(formula = MPG_City ~ Length + Horsepower, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.152 -1.558 0.154 1.492 8.563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.62587 2.22525 17.36 < 2e-16 ***
## Length -0.06191 0.01300 -4.76 3.5e-06 ***
## Horsepower -0.03690 0.00277 -13.31 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.38 on 223 degrees of freedom
## Multiple R-squared: 0.591, Adjusted R-squared: 0.587
## F-statistic: 161 on 2 and 223 DF, p-value: <2e-16
NOTICE: Comparing fit2: Length + Horsepower to fit1: Length
Length changed to \(-0.061651\) from \(-0.13983\)!!!!The effects of variables (\(\beta's\)) are defined within the model. They all depend on what else are included or accounted for!
We now add multiple variables to our model:
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{i4} + \beta_5x_{i5} + \beta_6x_{i6} + \epsilon\]
fit3 <- lm(MPG_City ~ Length + Horsepower + Width + Seating +
Cylinders + Displacement, data = data1)
summary(fit3) ##
## Call:
## lm(formula = MPG_City ~ Length + Horsepower + Width + Seating +
## Cylinders + Displacement, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.877 -1.462 0.073 1.149 8.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.63372 4.02793 11.33 < 2e-16 ***
## Length 0.04909 0.01730 2.84 0.005 **
## Horsepower -0.02000 0.00408 -4.90 1.8e-06 ***
## Width -0.35358 0.06879 -5.14 6.1e-07 ***
## Seating -0.24135 0.14955 -1.61 0.108
## Cylinders -0.27169 0.23292 -1.17 0.245
## Displacement -0.93813 0.37166 -2.52 0.012 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.04 on 219 degrees of freedom
## Multiple R-squared: 0.705, Adjusted R-squared: 0.697
## F-statistic: 87.2 on 6 and 219 DF, p-value: <2e-16
# shift+c+command: to comment the portion selected
# confint(fit3)
# 2*pnorm(2.84, lower.tail = FALSE) # pvalues
# newcar
# predict(fit3, newcar)
# RSS <- sum((fit3$residual)^2)
# MSE <- RSS/219
# RSE <- sqrt(MSE) Based on the output from model 3:
\(\hat\beta_{Length}\) = 0.049 for model 3. Is this estimate wrong?
To summarize, the effects of Length from the above three models are
data.frame(Model.1 = coef(fit1)[2], Model.2 = coef(fit2)[2], Model.3 = coef(fit3)[2]) Model.1 Model.2 Model.3
Length -0.14 -0.0619 0.0491
# process the lm output into a table
stargazer(fit1, fit2, fit3, type = output_format,
# ci=TRUE, ci.level = .95,
keep.stat = c("n", "rsq", "sigma2", "ser"))| Dependent variable: | |||
| MPG_City | |||
| (1) | (2) | (3) | |
| Length | -0.140*** | -0.062*** | 0.049*** |
| (0.016) | (0.013) | (0.017) | |
| Horsepower | -0.037*** | -0.020*** | |
| (0.003) | (0.004) | ||
| Width | -0.354*** | ||
| (0.069) | |||
| Seating | -0.241 | ||
| (0.150) | |||
| Cylinders | -0.272 | ||
| (0.233) | |||
| Displacement | -0.938** | ||
| (0.372) | |||
| Constant | 45.300*** | 38.600*** | 45.600*** |
| (2.900) | (2.230) | (4.030) | |
| Observations | 226 | 226 | 226 |
| R2 | 0.266 | 0.591 | 0.705 |
| Residual Std. Error | 3.170 (df = 224) | 2.380 (df = 223) | 2.040 (df = 219) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Given a fixed model with several predictors, say model 3, assume all the linear model assumptions are met. Let us walk through the possible outcomes and point out the limitations.
Fit the model:
fit3 <- lm(MPG_City ~ Length + Horsepower + Width + Seating +
Cylinders + Displacement, data = data1)
summary(fit3) ### key output##
## Call:
## lm(formula = MPG_City ~ Length + Horsepower + Width + Seating +
## Cylinders + Displacement, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.877 -1.462 0.073 1.149 8.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.63372 4.02793 11.33 < 2e-16 ***
## Length 0.04909 0.01730 2.84 0.005 **
## Horsepower -0.02000 0.00408 -4.90 1.8e-06 ***
## Width -0.35358 0.06879 -5.14 6.1e-07 ***
## Seating -0.24135 0.14955 -1.61 0.108
## Cylinders -0.27169 0.23292 -1.17 0.245
## Displacement -0.93813 0.37166 -2.52 0.012 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.04 on 219 degrees of freedom
## Multiple R-squared: 0.705, Adjusted R-squared: 0.697
## F-statistic: 87.2 on 6 and 219 DF, p-value: <2e-16
# shift+c+command: to comment the portion selected
# confint(fit3)
# 2*pnorm(2.84, lower.tail = FALSE) # pvalues
# newcar
# predict(fit3, newcar)
# RSS <- sum((fit3$residual)^2)
# MSE <- RSS/219
# RSE <- sqrt(MSE)## clean summary table and arrange it by tstat
fit3.summary <- summary(fit3) #names(summary(fit3))
data.frame(fit3.summary %>% coefficients) %>%
rename(estimate = Estimate,
se = "Std..Error",
tstat = "t.value",
pvalue = "Pr...t..") %>%
mutate(abs.tstat = abs(tstat)) %>%
arrange(desc(abs.tstat)) %>%
knitr::kable()Questions of interests:
Students please fill in the answers with output from R.
Write down the final OLS equation of MPG_City given the rest of the predictors.
What is the standard error from the output? Precisely what does it measure?
Interpret the \(R^2\) reported for this model. Do you feel comfortable using the output for the following questions based on this \(R^2\) value?
What does each t-interval and t-test do?
Can you conclude that with \(95\%\) confidence all the t confidence intervals provided above are right simultaneously?
Is Width THE most important variable, HP the second, etc since they each has the smallest p-value,…
Can we conclude that none of the Seating or Cylinders is needed?
Is Width most useful variable due to its largest coefficient in magnitude?
Can we conclude that none of the Seating or Cylinders is needed?
Base on Model 3, the mean of MPG_City among all cars with the same features as the new design: length=180, HP=225, width=80, seating=5, cylinders=4, displacement=3.5, transmission=“automatic”, continent=“Am” is
\[\hat{y} = 45.63 + 0.05 \times 180 - 0.02 \times 225 - 0.35 \times 80 - 0.24 \times 5 - 0.27 \times 4 - 0.94 \times 3.5 = 16.17, \] with a \(\text{standard error}\) for the regression line = 0.7839001.
We will get the prediction, standard error and the confident intervals by predict().
predict(fit3, newcar, interval = "confidence", se.fit = TRUE) ## $fit
## fit lwr upr
## 1 16.1 14.6 17.7
##
## $se.fit
## [1] 0.784
##
## $df
## [1] 219
##
## $residual.scale
## [1] 2.04
Q: What assumptions are needed to make this a valid confidence interval?
Base on Model 3, MPG_City for this particular new design is
\[\hat{y} = 45.63 + 0.05 \times 180 - 0.02 \times 225 - 0.35 \times 80 - 0.24 \times 5 - 0.27 \times 4 - 0.94 \times 3.5 = 16.17\] with a 95% prediction interval approximately to be
\[\hat{y} \pm 2\times RSE = 16.17 \pm 2 \times 2.036.\]
The prediction interval can be obtained by predict().
predict(fit3, newcar, interval = "predict", se.fit = TRUE) # future prediction intervals## $fit
## fit lwr upr
## 1 16.1 11.8 20.4
##
## $se.fit
## [1] 0.784
##
## $df
## [1] 219
##
## $residual.scale
## [1] 2.04
Q: What assumptions are needed to make this a valid prediction interval?
To check the model assumptions are met, we examine the residual plot and the qqplot of the residuals.
We use the first and second plots of plot(fit).
par(mfrow=c(1,2), mar=c(5,2,4,2), mgp=c(3,0.5,0)) # plot(fit3) produces several plots
plot(fit3, 1, pch=16) # residual plot
abline(h=0, col="blue", lwd=2)
plot(fit3, 2) # qqplot Are the linear model assumptions met for the model fit here (fit3)? What might be violated? -linearity? -Equal variances? -Normality?
With multiple variables, we need to test whether a set of \(\beta\)’s being 0. For example, we want to ask whether \(\beta_\text{Seating}\) and \(\beta_\text{Cylinder}\) are not useful. Mathematically, we test the null hypothesis
\[H_0: \beta_\text{Seating}=\beta_\text{Cylinder}=0 \mbox{ vs. } H_1: \text{at least one of } \beta_\text{Seating},\beta_\text{Cylinder} \text{ is non-zero}\]
To put it in another way, \(H_0\) is the reduced model:
\[y_i = \beta_0 + \beta_{\text{Length}} x_{\text{i,Length}} + \beta_{\text{HP}}x_{\text{i,HP}} + \beta_{\text{Width}}x_{\text{i, Width}} + \beta_{\text{Displacement}}x_{\text{i, Displacement}} + \epsilon_i\]
and \(H_1\) is the full model:
\[y_i = \beta_0 + \beta_{\text{Length}} x_{\text{i,Length}} + \beta_{\text{HP}}x_{\text{i,HP}} + \beta_{\text{Width}}x_{\text{i, Width}} + \\ \beta_{\text{Seating}}x_{\text{i, Seating}} + \beta_{\text{Cylinders}}x_{\text{i, Cylinders}} + \beta_{\text{Displacement}}x_{\text{i, Displacement}} + \epsilon_i\]
Theorem: Assume the linear model assumptions hold, then under the \(H_0\)
\[F_{\text{stat}}:=\frac{\frac{RSS(H_0)-RSS(H_1)}{df_1}}{\frac{RSS(H_1)}{df_2}} \sim F_{df_1, df_2}\] where \[\begin{split} {df}_1 &= \text{\# of parameters in } H_1 - \text{\# of parameters in } H_0 \\ {df}_2 &= \text{\# of observations} - \text{\# of parameters in } H_1 \end{split}.\]
The p-value is
\[\text{p-value}=P(F_{{df}_1, {df}_2} > F_{\text{stat}})\]
To explore, we can run a model without Seating and Cylinder.
fit3.0 <- lm(MPG_City ~ Horsepower + Length + Width + Displacement, data = data1) # under H0
sum((summary(fit3.0)$residuals)^2) # RSS(H_0)
sum((summary(fit3)$residuals)^2) # RSS(H_1)\[F_{\text{stat}}=\frac{\frac{RSS(H_0)-RSS(H_1)}{df_1}}{\frac{RSS(H_1)}{df_2}} = \frac{\frac{923.82 - 907.76}{2}}{\frac{907.76}{219}}=1.9373.\]
Based on the \(F\)-test given above, the \(p\)-value is \[P(F_{2,219} > 1.9373)\]
pf(1.9373, 2, 219, lower.tail=F) ## [1] 0.147
# hist(rf(10000, 2, 219), breaks=40) # take a look at the F disanova()We can use the anava() function to carry out the above F-test anova(H_0, H_1).
fit3.0 <- lm(MPG_City ~ Horsepower + Length + Width + Displacement, data = data1)
anova(fit3.0, fit3)## Analysis of Variance Table
##
## Model 1: MPG_City ~ Horsepower + Length + Width + Displacement
## Model 2: MPG_City ~ Length + Horsepower + Width + Seating + Cylinders +
## Displacement
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 221 924
## 2 219 908 2 16.1 1.94 0.15
Conclusion: The p-value being .1466 gives us no evidence to reject the null hypothesis.
Questions based on the summary for fit3
What is the \(F\) statistics from summary(fit3)?
Can you perform a \(F\)-test for each \(\beta\) being 0? What is the relationship between the F-test and the t-test?
TSS <- sum((data1$MPG_City - mean(data1$MPG_City))^2) # (226-1)*var(data1$MPG_City)
TSSRoughly speaking, the linearity, homoscedasticity and normality assumptions are satisfied. (Very subjective!)
Let’s use Continent as one variable. It has three categories. We explore the following questions:
Continent affect the MPG?levels(data1$Continent) #data1$Continent## [1] "Am" "As" "E"
Model with a categorical variable is termed as a “One Way ANOVA”
First, we get the sample means and sample standard error for each group, and plot each
data1 %>%
group_by(Continent) %>%
summarise(
mean = mean(MPG_City),
sd = sd(MPG_City),
n = n()
)## # A tibble: 3 x 4
## Continent mean sd n
## * <fct> <dbl> <dbl> <int>
## 1 Am 18.7 2.99 74
## 2 As 20.2 4.20 98
## 3 E 18.3 3.22 54
ggplot(data1) + geom_boxplot(aes(x = Continent, y = MPG_City))We use indicator predictors to analyze the effect of Continent.
\[y_{i|x=Am} = \mu_{Am} + \epsilon_i\] \[y_{i|x=As} = \mu_{As} + \epsilon_i\] \[y_{i|x=E} = \mu_{E} + \epsilon_i\] where \(\mu_{Am}\), \(\mu_{As}\) and \(\mu_{E}\) are the mean MPG of each continent and \(\epsilon \sim \mathcal{N}(0,\sigma^2)\). We want to compare the three means and this can be done through linear model with indicator functions.
We are interested in the following hypotheses:
\[H_0: \mu_{Am} = \mu_{As} = \mu_{E}\]
Let
Am as the base
\(x_1\) be the indicator function of being As, i.e. \(I\{Continent = As\}\)
\(x_2\) be the indicator function of being E \(I\{Continent = E\}\)
Then the above Anova model is same as
\[y_i = \beta_{Am} + \beta_{As} x_{1,i} + \beta_E x_{2,i} + \epsilon_i = \beta_{Am} + \beta_{As} I\{Continent = As\} + \beta_E I\{Continent = E\} + \epsilon_i\] where * \(\beta_{Am} = \textbf{E}(MPG|Am)\)
\(\beta_{As} = \textbf{E}(MPG|As)-\textbf{E}(MPG|Am)\): the increment between Asia and America
\(\beta_E = \textbf{E}(MPG|E)-\textbf{E}(MPG|Am)\): the increment between Europe and America
To test \[H_0: \mu_{Am} = \mu_{As} = \mu_{E}\] is same as to test \[H_0: \beta_{As} = \beta_{E} = 0\]
Now we can use lm() to fit the model and carry out the tests
fit.continent <- lm(MPG_City ~ Continent, data1)
summary(fit.continent) #confint(fit.continent); Anova(fit.continent); anova(fit.continent)##
## Call:
## lm(formula = MPG_City ~ Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.259 -2.730 -0.245 1.755 12.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.730 0.420 44.62 <2e-16 ***
## ContinentAs 1.515 0.556 2.72 0.0069 **
## ContinentE -0.470 0.646 -0.73 0.4673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.61 on 223 degrees of freedom
## Multiple R-squared: 0.0552, Adjusted R-squared: 0.0467
## F-statistic: 6.52 on 2 and 223 DF, p-value: 0.00178
The \(F_{test}\) here is to test \(H_0: \; \beta_{As} = \beta_{E} = 0\), i.e., there is no difference among the three regions. We reject \(H_0\) at \(\alpha=0.01\), and conclude the mean MPG’s are different among the three regions.
We can also see the coding from here that Am is the base, each coefficient captures the effect of the level vs. Am.
model.matrix(fit.continent)[5:10, ]Remarks: fit.continent can also be obtained from
fit.continent.1 <- lm(MPG_City ~ I(Continent == "Am") + I(Continent == "E") , data1)
summary(fit.continent.1)
model.matrix(fit.continent.1)[5:10, ]We also show a model without intercept \(\beta_0\). Can you interpret the summary table below?
fit.continent.2 <- lm(MPG_City ~ 0 + Continent , data1) #
summary(fit.continent.2) ##
## Call:
## lm(formula = MPG_City ~ 0 + Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.259 -2.730 -0.245 1.755 12.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ContinentAm 18.730 0.420 44.6 <2e-16 ***
## ContinentAs 20.245 0.365 55.5 <2e-16 ***
## ContinentE 18.259 0.491 37.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.61 on 223 degrees of freedom
## Multiple R-squared: 0.967, Adjusted R-squared: 0.966
## F-statistic: 2.15e+03 on 3 and 223 DF, p-value: <2e-16
model.matrix(fit.continent.2)[5:10, ] # try to show you the indicator variables created## ContinentAm ContinentAs ContinentE
## 5 0 1 0
## 6 0 1 0
## 7 0 0 1
## 8 0 0 1
## 9 0 0 1
## 10 0 0 1
Questions from the above fit:
constrast()
To estimate the mean difference between two regions, we can use the contrast() function from package contrast.
contrast(fit.continent, list(Continent = 'As'), list(Continent = 'Am'))
contrast(fit.continent, list(Continent = 'As'), list(Continent = 'E')) # comp As with EChanging base level
We can choose the base level as we want. We demonstrate this by setting up European cars as the base category:
data1$Continent <- factor(data1$Continent, levels = c("E", "Am", "As"))
summary(lm(MPG_City ~ Continent, data1))##
## Call:
## lm(formula = MPG_City ~ Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.259 -2.730 -0.245 1.755 12.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.259 0.491 37.16 <2e-16 ***
## ContinentAm 0.470 0.646 0.73 0.4673
## ContinentAs 1.986 0.612 3.24 0.0014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.61 on 223 degrees of freedom
## Multiple R-squared: 0.0552, Adjusted R-squared: 0.0467
## F-statistic: 6.52 on 2 and 223 DF, p-value: 0.00178
Note that:
# Change it back
data1$Continent <- factor(data1$Continent, levels = c("Am", "As", "E")) Drawback of one way Anova: other factors should be also taken into consideration.
Let us take one important variable Horsepower in addition to the Continent. We first show three scatter plots by region with lm line added. Though the slopes seem to be little different for the sake of simple interpretation we will fit a usual additive model without interaction, that is, we assume the effect of HP over MPG_City are the same.
ggplot(data1, aes(x = Horsepower, y = MPG_City, color=Continent)) +
geom_point() +
geom_smooth(method = "lm" , se =F) +
facet_wrap(~ Continent) + labs(title="Highway MPG by Weight")## `geom_smooth()` using formula 'y ~ x'
# ggplot(data1, aes(x = Horsepower, y = MPG_City, color = Continent)) +
# geom_point() +
# geom_smooth(method = "lm" , se =F) +
# labs(title = "Model With Interactions")We next plot our models without interactions.
fit.no.interation <- lm(MPG_City ~ Continent+Horsepower, data1)
coefs <- (summary(fit.no.interation))$coefficients[, 1]
ggplot(data1, aes(x = Horsepower, y = MPG_City, color = Continent)) +
geom_point() +
geom_abline(intercept =coefs[1], slope=coefs[4], color ="#F8766D") +
geom_abline(intercept =coefs[1]+coefs[2], slope=coefs[4], color = "#00BA38") +
geom_abline(intercept =coefs[1]+coefs[3], slope=coefs[4], color ="#619CFF") +
labs(title = "Model Without Interactions")For a model without interaction, we assume that the effect of HP on MPG_{City} is the same regardless of which region the cars are produced.
\[MPG = \beta_{Am} + \beta_{As} \cdot I(Continent = As) + \beta_E \cdot I(Continent = E) + \beta_1 \cdot HP + \epsilon\]
fit.no.interation <- lm(MPG_City ~ Horsepower + Continent, data1)
summary(fit.no.interation) ##
## Call:
## lm(formula = MPG_City ~ Horsepower + Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.948 -1.486 0.053 1.476 8.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.64616 0.62290 44.38 <2e-16 ***
## Horsepower -0.04212 0.00261 -16.11 <2e-16 ***
## ContinentAs 1.03893 0.37958 2.74 0.0067 **
## ContinentE 0.44071 0.44341 0.99 0.3213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.46 on 222 degrees of freedom
## Multiple R-squared: 0.564, Adjusted R-squared: 0.558
## F-statistic: 95.9 on 3 and 222 DF, p-value: <2e-16
To test whether Continent is significant after controlling for Horsepower, we can use anova(H_0, H_1).
anova(lm(MPG_City ~ Horsepower, data1), fit.no.interation ) ## Analysis of Variance Table
##
## Model 1: MPG_City ~ Horsepower
## Model 2: MPG_City ~ Horsepower + Continent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 224 1386
## 2 222 1340 2 46 3.81 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value = 0.0237, a weak evidence of rejecting the null hypothesis that there is no Continent effects.
Anova()We can also use Anova() from the car package.
Anova(fit.no.interation) ### anova(fit.no.interation) # doesn't give us the right F test!!!! It is a sequencial test: given the variables above, do we need the current one?## Anova Table (Type II tests)
##
## Response: MPG_City
## Sum Sq Df F value Pr(>F)
## Horsepower 1567 1 259.47 <2e-16 ***
## Continent 46 2 3.81 0.024 *
## Residuals 1340 222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We fail to reject \(H_0\) at \(\alpha=0.01\), i.e., controlling HP, we do not have the evidence to say the effect of Continent is different.
Remark: Models without interactions are simple and we can test Continent effects directly. But we may ask ourselves is this assumption reasonable?
It is possible that HP effects depend on Continent.
ggplot(data1, aes(x = Horsepower, y = MPG_City, color = Continent)) +
geom_point() +
geom_smooth(method = "lm" , se =F) +
labs(title = "Model With Interactions")## `geom_smooth()` using formula 'y ~ x'
As we can see from the plot, Asian cars with smaller HP are more efficient; while European cars with larger HP are more efficient. To capture this effect, we introduce models with interactions MPG ~ Horsepower by Continent.
\[y_{i|Am,\, HP} = \beta_{Am} + \beta_{1_{Am}} \cdot HP + \epsilon_i\] \[y_{i|As,\, HP} = \beta_{As} + \beta_{1_{As}} \cdot HP + \epsilon_i\]
\[y_{i|E,\, HP} = \beta_{E} + \beta_{1_{E}} \cdot HP + \epsilon_i\]
This can be done through the following indicators:
\[MPG= \beta_{Am} + \beta_{As} \cdot I(Continent = As) + \beta_E \cdot I(Continent = E) + \\ \beta_{1_{Am}} \cdot HP + \beta_{1_{As}} \cdot HP \cdot I(Continent = As) + \beta_{1_E} \cdot HP \cdot I(Continent = E) + \epsilon\]
Similarly,
\(\beta_{As}\) is the increment of intercept between Asian and American, …
\(\beta_{1_{As}}\) is the increment of slope between Asian and American, …
Use lm() with interaction:
fit.with.interaction <- lm(MPG_City ~ Horsepower*Continent, data1)
summary(fit.with.interaction) #model.matrix(fit.with.interaction)[1:3,]##
## Call:
## lm(formula = MPG_City ~ Horsepower * Continent, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.792 -1.596 -0.021 1.610 7.794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.60189 1.04938 25.35 < 2e-16 ***
## Horsepower -0.03718 0.00478 -7.78 2.8e-13 ***
## ContinentAs 4.40301 1.35269 3.25 0.0013 **
## ContinentE -0.73505 1.51377 -0.49 0.6278
## Horsepower:ContinentAs -0.01651 0.00629 -2.63 0.0092 **
## Horsepower:ContinentE 0.00458 0.00654 0.70 0.4842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.4 on 220 degrees of freedom
## Multiple R-squared: 0.59, Adjusted R-squared: 0.58
## F-statistic: 63.3 on 5 and 220 DF, p-value: <2e-16
#summary(lm(MPG_City ~ Horsepower + Continent + Horsepower*Continent, data1)) # this is same as fit.with.interactionQuestion: Is the interaction effect significant? Are the effects of HP the same across Continent?
To test there is no interaction is same as to test \[H_0: \beta_{1_{AS}} = \beta_{1_E} = 0\] which we can do with anova()
anova(fit.no.interation, fit.with.interaction)## Analysis of Variance Table
##
## Model 1: MPG_City ~ Horsepower + Continent
## Model 2: MPG_City ~ Horsepower * Continent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 222 1340
## 2 220 1262 2 78.3 6.82 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(fit.with.interaction)## Anova Table (Type II tests)
##
## Response: MPG_City
## Sum Sq Df F value Pr(>F)
## Horsepower 1567 1 273.07 <2e-16 ***
## Continent 46 2 4.01 0.0196 *
## Horsepower:Continent 78 2 6.82 0.0013 **
## Residuals 1262 220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We do reject the null hypothesis with a \(p\)-value to be \(.0013\), not terribly small. That indicates some interaction effect of HP and Continent.
So the final conclusion would be
The interpretation of the model is no longer neat.
We now try a model which includes all sensible variables.
names(data1)
names.exclude <- names(data1) %in% c("Make.Model", "MPG_Hwy", "Make")
# names.exclude <- c("Make.Model", "MPG_Hwy", "Make") # doesn't work.
data2 <- data1[!names.exclude] # Take a subset with all var's but "Make.Model","MPG_Hwy","Make"
names(data2) #str(data2) table(data2$Transmission)Or use dplyr
data2 <- select(data1, -Make.Model, -MPG_Hwy, -Make)
names(data2)## [1] "Continent" "MPG_City" "Horsepower" "Weight"
## [5] "Length" "Width" "Seating" "Cylinders"
## [9] "Displacement" "Transmission"
QUESTION: What would have happened if you have added Make.Model into the model???
fit.all <- lm(MPG_City ~., data2) #convenient way to include all variables
summary(fit.all)##
## Call:
## lm(formula = MPG_City ~ ., data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.275 -1.024 0.038 0.913 6.933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.54629 3.78785 7.80 2.7e-13 ***
## ContinentAs 0.46536 0.29121 1.60 0.11151
## ContinentE 0.05009 0.42770 0.12 0.90688
## Horsepower -0.01331 0.00389 -3.42 0.00075 ***
## Weight -3.50280 0.38932 -9.00 < 2e-16 ***
## Length 0.03313 0.01492 2.22 0.02737 *
## Width -0.01212 0.06742 -0.18 0.85749
## Seating 0.31648 0.14234 2.22 0.02724 *
## Cylinders -0.19980 0.20535 -0.97 0.33166
## Displacement -0.16762 0.39266 -0.43 0.66988
## Transmissioncont_variable 1.22163 0.99566 1.23 0.22119
## Transmissionmanual -0.24546 0.55587 -0.44 0.65924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.68 on 214 degrees of freedom
## Multiple R-squared: 0.803, Adjusted R-squared: 0.793
## F-statistic: 79.2 on 11 and 214 DF, p-value: <2e-16
data3 <- data2
data3$Cylinders <- as.factor(data3$Cylinders)
Anova(lm(MPG_City ~., data3) ) # interestingly Cyliners is very sig as a categorical variable.## Anova Table (Type II tests)
##
## Response: MPG_City
## Sum Sq Df F value Pr(>F)
## Continent 10 2 2.13 0.121
## Horsepower 15 1 6.42 0.012 *
## Weight 212 1 90.03 < 2e-16 ***
## Length 13 1 5.35 0.022 *
## Width 1 1 0.30 0.584
## Seating 13 1 5.67 0.018 *
## Cylinders 113 4 12.07 7.5e-09 ***
## Displacement 6 1 2.67 0.104
## Transmission 8 2 1.68 0.189
## Residuals 496 211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(t\)-table is pretty messy, many variables are not significant, need to do Anova()
Anova(fit.all)## Anova Table (Type II tests)
##
## Response: MPG_City
## Sum Sq Df F value Pr(>F)
## Continent 10 2 1.79 0.16972
## Horsepower 33 1 11.71 0.00075 ***
## Weight 229 1 80.95 < 2e-16 ***
## Length 14 1 4.93 0.02737 *
## Width 0 1 0.03 0.85749
## Seating 14 1 4.94 0.02724 *
## Cylinders 3 1 0.95 0.33166
## Displacement 1 1 0.18 0.66988
## Transmission 5 2 0.89 0.41282
## Residuals 606 214
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comments:
Continent is not needed after accounting for all other variables. That says we don’t have evidence to show that one continent tends to produce more efficient cars than the others.Transmission. There are predominately automatic cars. Perhaps we should only include automatic cars and change our population.table(data2$Transmission)##
## automatic cont_variable manual
## 211 3 12
Model diagnosis: check if three assumptions of linear models are met.
Residuals vs Fitted and qqnormal Plot, both plots look fine.
par(mfrow=c(1,2))
plot(fit.all, 1)
plot(fit.all, 2) We now provide confidence Interval for the mean of cars like our newcar
fit.mean.new <- predict(fit.all, newcar, interval = "confidence")
fit.mean.new## fit lwr upr
## 1 17.7 16.3 19.1
We also provide prediction interval on this newcar
fit.new <- predict(fit.all, newcar, interval = "prediction", se.fit=TRUE)
fit.new$fit## fit lwr upr
## 1 17.7 14.1 21.3
fit.new$se.fit## [1] 0.702
fit.new$residual.scale## [1] 1.68
Summary about linear models:
Linear model is simple and has nice interpretation
Coefficient for each predictor depends on the model
LS automatically adjusts the unit of predictors
\(t\)-tests: the effect of each coeff. (Does not rank the importance of the predictors)
\(F\)-tests: test a set of predictors (t-tests being special cases!)
Model assumptions: all in residuals
The most damaging model violation is the presence of heteroscedasticity. It threatens the validity of each test. But we can use sandwich estimators to get the standard errors. See section Heteroscedasticity in the appendix.
Last but not least: linear models used here only show associations between predictors and the response on average. We CANNOT conclude causality at all.
BIGGEST QUESTION: How to identify a set of important variables or which model to use?????
Choose models with small “prediction errors” is one way to go. We will discuss more in a lecture about model selection.
We show how we cleaned the dataset Cars_04.csv into car_04_relular.csv
# Caution: you need to have Cars_04.csv in order to run this chunk!!!!!
data = read.csv("data/Cars_04.csv", header=T, as.is=F)
str(data) # There are 242 cars in the data
data.comp <- na.omit(data) # It deletes any car with a missing value - not a good idea.
str(data.comp) #182 cars with complete records for all the variables.
## Explore the data
names(data)
head(data)
summary(data)
par(mgp=c(1.8,.5,0), mar=c(3,3,2,1))
hist(data$Horsepower, breaks=20, col="blue") # notice that there are some super powerful cars
## Let's find out which are the super powerful cars
data$Make.Model[data$Horsepower > 400] # Show models with Horsepower > 400
data$Make.Model[data$Horsepower > 390]
data$Make.Model[data$Horsepower <= 100]
## We could find cars with horsepower > 400
data[data$Horsepower > 400, "Make.Model"]
## Let's concentrate on "regular" cars and exclude those super expensive ones
datacar <- data[(data$Horsepower <=390) & (data$Horsepower > 100), ]
## Take a subset with relevant variables
variables <- c("Make.Model", "Continent", "MPG_City", "MPG_Hwy",
"Horsepower","Weight","Length", "Width",
"Seating","Cylinders","Displacement",
"Make", "Transmission")
data1 <- datacar[, variables] # subset
str(data1)
write.csv(data1, file="car_04_regular.csv", row.names = FALSE)
# Output the data and name it car_04_regular.csvTake fit3 and recreate \(\hat\beta\)’s.
design.x <- model.matrix(fit3) #fit3$model gives us Y and x1 x2...xp
y <- fit3$model[, 1]
design.x[, 1]
rse <- (summary(fit3))$sigma
beta <- (solve( t(design.x) %*% design.x)) %*% t(design.x) %*% y # reconstruct LS estimates
betaReconstructing the covariance matrix from fit3
summary(fit3)$cov.unscaled # inverse (X' X)
cov.beta <- (rse^2) * (summary(fit3)$cov.unscaled) # alternatively we can get the cov matrix this wayCalculating the Standard Error from fit3
sd.beta <- sqrt(diag(cov.beta)) # check to see this agrees with the sd for each betahat
sd.beta # this shoulbe be same as the Std. Error in the summary table
summary(fit3)$coeff[, "Std. Error" ]Sandwich estimator:
In a linear model, when \(\text{Var}(y_i| x_i)=\sigma^2(x_i)\), i.e., the linear model is hereroscedastic. While the least squared estimator of \(\beta\) is still unbiased but, the usual variance \(\text{Var}( {\bf\hat{\beta}}) \neq \sigma^2 (X^T X)^{-1}\). It has been long known among in econometrics that we can correct the variance estimator by a Sandwich estimator. Since \[\text{Var}( {\bf\hat{\beta}}) = (X^T X)^{-1}(X^T D X) (X^T X)^{-1}\] where \(D\) is a diagonal matrix of \(d_i=\sigma^2_i=\text{Var}(y_i|x_i)\). We of course do not know \(d_i\). So we use squared residual to estimate \(d_i\): \(\hat d_i = (y_i-\hat y_i)^2\). That yields famous Sandwich Estimator due to White (1980): \[\text{Var}( {\bf\hat{\beta}}) = (X^T X)^{-1}(X^T \hat D X) (X^T X)^{-1}\] This correction has been implemented in various packages. We use sandwich with function lmtest::coeftest(fm, df = Inf, vcov = vcovHC). HC means:Heteroscedasticity-Consistent Covariance Matrix Estimation.
To produce a HC corrected confidence intervals for \(\beta\), we first get a linear fit as \(\text{fit} = lm(y~....)\). Then apply: lmtest::coeftest(fit, df = Inf, vcov = vcovHC) to get the correct confidence intervals for \(\beta\).
Remark: Even linearity is not there, we can still fit a linear model. We can interpret the linear function as an approximation to the mean. The confidence intervals obtained using the sandwich estimators are still valid. See our recent work Models as Approximations I: Consequences Illustrated with Linear Regression.
Let us revisit the final model with all variables:
data2 <- select(data1, -Make.Model, -MPG_Hwy, -Make)
fit.all <- lm(MPG_City ~., data2)
Anova(fit.all)## Anova Table (Type II tests)
##
## Response: MPG_City
## Sum Sq Df F value Pr(>F)
## Continent 10 2 1.79 0.16972
## Horsepower 33 1 11.71 0.00075 ***
## Weight 229 1 80.95 < 2e-16 ***
## Length 14 1 4.93 0.02737 *
## Width 0 1 0.03 0.85749
## Seating 14 1 4.94 0.02724 *
## Cylinders 3 1 0.95 0.33166
## Displacement 1 1 0.18 0.66988
## Transmission 5 2 0.89 0.41282
## Residuals 606 214
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit.all, 1) From the residual plot, one may worry the violation of linearity, as well as presence of heteroscedasticity.
To provide a more trust-worthy analysis we may use sandwich standard errors for a more accurate inferences over coefficients.
#sandwich(fit.all) # gives us HC cov estimator
#vcovHC(fit.all) # also gives us the HC cov matrix
fit.all.hc <- lmtest::coeftest(fit.all, df = Inf, vcov = vcovHC)
fit.all.hc # output##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 29.54629 3.62610 8.15 3.7e-16 ***
## ContinentAs 0.46536 0.26515 1.76 0.0793 .
## ContinentE 0.05009 0.41556 0.12 0.9041
## Horsepower -0.01331 0.00483 -2.76 0.0058 **
## Weight -3.50280 0.39419 -8.89 < 2e-16 ***
## Length 0.03313 0.01767 1.87 0.0608 .
## Width -0.01212 0.05920 -0.20 0.8378
## Seating 0.31648 0.15399 2.06 0.0399 *
## Cylinders -0.19980 0.23428 -0.85 0.3938
## Displacement -0.16762 0.40519 -0.41 0.6791
## Transmissioncont_variable 1.22163 0.62256 1.96 0.0497 *
## Transmissionmanual -0.24546 0.87440 -0.28 0.7789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let compare regulare lm output and the one with sandwich cov estimation:
stargazer(fit.all, fit.all.hc, type = output_format ) | Dependent variable: | ||
| MPG_City | ||
| OLS | coefficient | |
| test | ||
| (1) | (2) | |
| ContinentAs | 0.465 | 0.465* |
| (0.291) | (0.265) | |
| ContinentE | 0.050 | 0.050 |
| (0.428) | (0.416) | |
| Horsepower | -0.013*** | -0.013*** |
| (0.004) | (0.005) | |
| Weight | -3.500*** | -3.500*** |
| (0.389) | (0.394) | |
| Length | 0.033** | 0.033* |
| (0.015) | (0.018) | |
| Width | -0.012 | -0.012 |
| (0.067) | (0.059) | |
| Seating | 0.316** | 0.316** |
| (0.142) | (0.154) | |
| Cylinders | -0.200 | -0.200 |
| (0.205) | (0.234) | |
| Displacement | -0.168 | -0.168 |
| (0.393) | (0.405) | |
| Transmissioncont_variable | 1.220 | 1.220** |
| (0.996) | (0.623) | |
| Transmissionmanual | -0.245 | -0.245 |
| (0.556) | (0.874) | |
| Constant | 29.500*** | 29.500*** |
| (3.790) | (3.630) | |
| Observations | 226 | |
| R2 | 0.803 | |
| Adjusted R2 | 0.793 | |
| Residual Std. Error | 1.680 (df = 214) | |
| F Statistic | 79.200*** (df = 11; 214) | |
| Note: | p<0.1; p<0.05; p<0.01 | |
While the sandwich standard errors remain more or less the same as before, we do notice one difference: Asian cars are more efficient than that of American cars at \(\alpha = .1\).
The estimators for each coefficient remain the same!
cov <- vcovHC(fit.all, type = "HC")
robust.se <- sqrt(diag(cov))
stargazer(fit.all, fit.all.hc, type = output_format,
se = list(NULL, robust.se)) Remark: How to incoporate heteroscedasticity in prediction intervals? One easy way out is to fit another model taking residual square as response and use predicted values as the variance give \(x\). We could use a more relaxed non-linear model such as trees/Random Forests.
The \(x\) variable may relate to \(y\) through a function of \(x\), say \(log(x)\) or \(x^{2}\).
Based on the following Horsepower may need a transformation
pairs(data1[, -1]) We look at Horsepower vs. \(\frac{1}{Horsepower}\)
par(mfrow=c(1,2))
plot(data1$Horsepower, data1$MPG_City, pch=16)
plot(1/data1$Horsepower, data1$MPG_City, pch=16) #looks better!\(\frac{1}{Horsepower}\) Looks good, so we make a model using that.
fit.transform <- lm(MPG_City~ I(1/Horsepower), data1) # Use I()
plot(fit.transform, 1)We also try to fit \(Horsepower^{2}\)
fit12 <- lm(MPG_City ~ Horsepower + I(Horsepower^2), data1) # fit a quadratic function
summary(fit12)
plot(fit12$fitted, fit12$residuals, pch=16)
abline(h=0, lwd=4, col="red")plot(fit12, 2)We look to fit a model containing 100 points of the equation
\[y = 1 +2x + \mathcal{N}(0,2) \quad i.e. \quad \beta_0 = 1 ,\beta_1 = 2, \sigma^2 = 2.\]
par(mfrow=c(1,1))
x <- runif(100)
y <- 1+2*x+rnorm(100,0, 2)
fit <- lm(y~x)
fit.perfect <- summary(lm(y~x))
rsquared <- round(fit.perfect$r.squared,2)
hat_beta_0 <- round(fit.perfect$coefficients[1], 2)
hat_beta_1 <- round(fit.perfect$coefficients[2], 2)
plot(x, y, pch=16,
ylim=c(-8,8),
xlab="a perfect linear model: true mean: y=1+2x in blue, LS in red",
main=paste("R squared= ",rsquared,
", LS estimates b1=",hat_beta_1, "and b0=", hat_beta_0))
abline(fit, lwd=4, col="red")
lines(x, 1+2*x, lwd=4, col="blue")Residual plot
plot(fit$fitted, fit$residuals, pch=16,
ylim=c(-8, 8),
main="residual plot")
abline(h=0, lwd=4, col="red")Check normality
qqnorm(fit$residuals, ylim=c(-8, 8))
qqline(fit$residuals, lwd=4, col="blue")The following example tells us we can’t look at y directly to check the normality assumption. Why not? We really need to examine residuals!
par(mfrow=c(2,2))
x <- runif(1000)
y <- 1+20*x+rnorm(1000,0, 2)
plot(x, y, pch=16)
hist(y, breaks=40)
qqnorm(y, main="qqnorm for y")
qqline(y)
fit <- lm(y~x)
qqnorm(fit$residuals, main="qqnorm for residuals")
qqline(fit$residuals)par(mfrow=c(1,1))
data.frame( mean = mean(y), std.dev = sd(y))When some \(x\)’s are highly correlated we can not separate the effect. But it is still fine for the purpose of prediction.
A simulation to illustrate some consequences of colinearity. Each \(p\)-value for \(x_1\) and \(x_2\) is large but the null of both \(\beta's = 0\) are rejected…. Because \(x_1\) and \(x_2\) are highly correlated.
par(mfrow=c(2,1))
set.seed(1)
x1 <- runif(100)
x2 <- 2*x1+rnorm(100, 0,.1) # x1 and x2 are highly correlated
y <- 1+2*x1+rnorm(100,0, .7) # model
newdata.cor <- cbind(y, x1,x2) # to see the strong correlations..
pairs(newdata.cor, pch=16)\(x_1\) is a useful predictor
summary(lm(y~x1)) ##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0132 -0.4792 -0.0524 0.4297 1.6787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.082 0.161 6.72 1.2e-09 ***
## x1 1.872 0.277 6.76 1.0e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.737 on 98 degrees of freedom
## Multiple R-squared: 0.318, Adjusted R-squared: 0.311
## F-statistic: 45.7 on 1 and 98 DF, p-value: 9.96e-10
\(x_2\) is a useful predictor
summary(lm(y~x2))##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9509 -0.5022 -0.0132 0.4831 1.6305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.112 0.157 7.08 2.2e-10 ***
## x2 0.909 0.134 6.77 9.4e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.737 on 98 degrees of freedom
## Multiple R-squared: 0.319, Adjusted R-squared: 0.312
## F-statistic: 45.9 on 1 and 98 DF, p-value: 9.39e-10
\(x_1\) and \(x_2\) together show that one is not
summary(lm(y~x1+x2)) ##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9818 -0.5091 -0.0376 0.4437 1.6351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.091 0.162 6.72 1.2e-09 ***
## x1 0.863 1.636 0.53 0.60
## x2 0.497 0.794 0.63 0.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.739 on 97 degrees of freedom
## Multiple R-squared: 0.321, Adjusted R-squared: 0.307
## F-statistic: 22.9 on 2 and 97 DF, p-value: 7.1e-09
Putting both highly correlated var’s together we can’t separate the effect of each one, though the model is still useful!
data.frame(Corealation = cor(x1, x2))## Corealation
## 1 0.985
Model dignoses show validities of model assupmtions.
plot(lm(y~x1+x2), 1:2)A quick look at an \(F\) distribution. One may change \(df_1 = 4\) and \(df_2=221\) to see the changes in the distribution.
##
hist(rf(10000, 6, 226-7), freq=FALSE, breaks=200) # pretty skewed to the rightFstat=(summary(fit3))$fstat # The Fstat, df1, df2
pvalue=1-pf(Fstat[1], 6, 219) #or pf(Fstat[1], 6, 219, lower.tail=FALSE)
# As long as Fstat is larger than 2.14, we reject the null at .05 level.
data.frame(F_stat = Fstat[1] , pvalue = pvalue , Cutoff = qf(.95, 6, 219))